\documentclass[a4paper,11pt,english]{article}

\usepackage{babel}
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}

\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
  bookmarksopen=true,
  bookmarksnumbered=true,
  pdftitle={Bayesian data analysis},
  pdfsubject={Comments},
  pdfauthor={Aki Vehtari},
  pdfkeywords={Bayesian probability theory, Bayesian inference, Bayesian data analysis},
  pdfstartview={FitH -32768},
  colorlinks=true,
  linkcolor=black,
  citecolor=black,
  filecolor=black,
  urlcolor=black
}


% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm

%\parskip=\baselineskip

\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Poisson}{Poisson}
\DeclareMathOperator{\betacdf}{betacdf}
\DeclareMathOperator{\pbeta}{pbeta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}

\pagestyle{empty}

\begin{document}
\thispagestyle{empty}

\section*{Bayesian data analysis -- reading instructions 2} 
\smallskip
{\bf Aki Vehtari}
\smallskip

\subsection*{Chapter 2 -- outline}

Outline of the chapter 2
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item 2.1 Binomial model (e.g. biased coin flipping)
\item 2.2 Posterior as compromise between data and prior information
\item 2.3 Posterior summaries
\item 2.4 Informative prior distributions (skip exponential families and sufficient statistics)
\item 2.5 Gaussian model with known variance
\item 2.6 Other single parameter models
  \begin{list}{-}{\parsep=0pt\itemsep=2pt}
  \item in this course the normal distribution with known mean but
    unknwon variance is the most important
  \item glance through Poisson and exponential
  \end{list}
\item 2.7 glance through this example, which illustrates benefits of prior information, no need to read all the details (it's quite long example)
\item 2.8 Noninformative and weakly informative priors
\end{list}

Laplace's approach for approximating integrals is discussed in more
detail in Chapter 4.

R and Python demos (\url{https://github.com/avehtari/BDA_R_demos} and \url{https://github.com/avehtari/BDA_py_demos})
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item demo2\_1: Binomial model and Beta posterior.
\item demo2\_2: Comparison of posterior distributions with different
parameter values for the Beta prior distribution.
\item demo2\_3: Use samples to plot histogram with quantiles, and
  the same for a transformed variable.
\item demo2\_4: Grid sampling using inverse-cdf method.
\end{list}

\subsection*{Chapter 2 -- most important terms}

Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others. See also the additional comments below.
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item binomial model
\item Bernoulli trial
\item exchangeability
\item $\Bin$, $\binom{n}{y}$
\item Laplace's law of succession
\item think which expectations in eqs. 2.7-2.8
\item summarizing posterior inference
\item mode, mean, median, standard deviation, variance, quantile
\item central posterior interval
\item highest posterior density interval / region
\item uninformative / informative prior distribution
\item principle of insufficient reason
\item hyperparameter
\item conjugacy, conjugate family, conjugate prior distribution, natural conjugate prior
\item nonconjugate prior
\item normal distribution
\item conjugate prior for mean of normal distribution with known variance
\item posterior for mean of normal distribution with known variance
\item precision
\item posterior predictive distribution
\item normal model with known mean but unknown variance
\item proper and improper prior
\item unnormalized density
\item Jeffreys' invariance principle
\item note non-uniqueness of noniformative priors for the binomial parameter
\item difficulties with noninformative priors
\item weakly informative priors
\end{list}

\subsection*{Integration over Beta distribution}

Chapter 2 has an example of analysing the ratio of girls born in Paris
1745--1770.  Laplace used binomial model and uniform prior which
produces Beta distribution as posterior distribution. Laplace wanted
to calculate $p(\theta \geq 0.5)$, which is obtained as
\begin{eqnarray*}
  p(\theta \geq 0.5) &=& \int_{0.5}^1  p(\mathbf{\theta}|y,n,M) d\theta \\
  &=& \frac{493473!}{241945!251527!} \int_{0.5}^1 \theta^y(1-\theta)^{n-y} d\theta
\end{eqnarray*}
Note that $\Gamma(n)=(n-1)!$.
%
Integral has a form which is called \emph{incomplete Beta function}.
Bayes and Laplace had difficulties in computing this, but nowadays
there are several series and continued fraction expressions.
%
% For example, betacdf function in Matlab first tries a continued
% fraction expression and if it diverges, then an alternative
% approximation.
Furthermore usually the normalisation term is computed by computing
$\log(\Gamma(\cdot))$ directly without explicitly computing
$\Gamma(\cdot)$.
%
Bayes was able to solve integral given small $n$ and $y$.
%
In case of large $n$ and $y$, Laplace used Gaussian approximation of
the posterior (more in chapter 4).
%
In this specific case, R pbeta gives the same results as Laplace's
result with at least 3 digit accuracy.

\subsection*{Numerical accuracy}

Laplace calculated
\begin{equation*}
  p(\theta \geq 0.5 | y, n, M) \approx 1.15 \times 10^{-42}.
\end{equation*}
Correspondingly Laplace could have calculated
\begin{equation*}
  p(\theta \geq 0.5 | y, n, M) = 1 - p(\theta \leq 0.5 | y, n, M),
\end{equation*}
which in theory could be computed in R with
{\tt 1-pbeta(0.5,y+1,n-y+1)}.  In practice this fails, due to the
limitation in the floating point representation used by the computers.
In R the largest floating point number which is smaller than 1 is
about 1-eps/4, where eps is about $2.22 \times 10^{-16}$ (the smallest
floating point number larger than 1 is 1+eps). Thus the result from
{\tt pbeta(0.5,y+1,n-y+1)} will be rounded to 1 and $1-1=0\neq 1.15
\times 10^{-42}$. We can compute $p(\theta \geq 0.5 | y, n, M)$ in R with {\tt pbeta(0.5, y+1, n-y+1, lower.tail=FALSE)}.

\subsection*{Highest Posterior Density interval}

HPD interval is not invariant to reparametrization. Here's an illustrative
example (using R and package {\tt HDInterval}):
 {\small
\begin{verbatim}
> r <- exp(rnorm(1000))
> quantile(log(r),c(.05, .95))
       5%       95% 
-1.532931  1.655137 
> log(quantile(r,c(.05, .95)))
       5%       95% 
-1.532925  1.655139 
> hdi(log(r), credMass = 0.9)
    lower     upper 
-1.449125  1.739169 
attr(,"credMass")
[1] 0.9
> log(hdi(r, credMass = 0.9))
    lower     upper 
-2.607574  1.318569 
attr(,"credMass")
[1] 0.9
\end{verbatim}
}

\subsection*{Gaussian distribution in more complex models and methods}

Gaussian distribution is commonly used in mixture models, hierarchical
models, hierarchical prior structures, scale mixture distributions,
Gaussian latent variable models, Gaussian processes, Gaussian random
Markov fields, Kalman filters, proposal distribution in Monte Carlo
methods, etc.

\subsection*{Predictive distribution}

Often the predictive distribution is more interesting than the
posterior distribution. The posterior distribution describes the
uncertainty in the parameters (like the proportion of red chips in the
bag), but the predictive distribution describes also the uncertainty
about the future event (like which color is picked next). This
difference is important, for example, if we want to what could happen
if some treatment is given to a patient.

In case of Gaussian distribution with known variance $\sigma^2$ the model is
\begin{align*}
  y\sim \N(\theta,\sigma^2),
\end{align*}
where $\sigma^2$ describes aleatoric uncertainty.
Using uniform prior the posterior is 
\begin{align*}
  p(\theta|y) \sim \N(\theta|\bar{y},\sigma^2/n),
\end{align*}
where $\sigma^2/n$ described epistemic uncertainty related to $\theta$.
Using uniform prior the posterior predictive distribution is 
\begin{align*}
  p(\theta|y) \sim \N(\theta|\bar{y},\sigma^2+\sigma^2/n),
\end{align*}
where the uncertainty is sum of epistemic ($\sigma^2/n$) and aleatoric
uncertainty ($\sigma^2$).

\subsection*{Weakly informative priors}

Our thinking has advanced since section 2.9 was written. See the Prior
Choice Wiki
(\url{https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations}
for more recent general discussion and model specific recommendations.

\subsection*{Should we worry about rigged priors?}

Andrew Gelman's blog post answering worries that data analyst would
choose a too optimistic prior
\url{http://andrewgelman.com/2017/10/04/worry-rigged-priors/}.


\end{document}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End:
